Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .
translated by 谷歌翻译
基于能量的模型(EBMS)允许极其灵活的概率分布规范。然而,它们不提供从这些分布中获得精确样本的机制。蒙特卡罗技术可以帮助我们获得样品,如果我们可以轻易采用可用的一些建议分布。例如,抑制采样可以提供精确的样本,但由于需要找到上限目标分布的提案分布,通常难以或不可能应用。大致马克洛夫链Monte Carlo采样技术通常更容易设计,利用在不断发展的样本上执行本地编辑的本地提案分布。然而,由于提案分布的本地性质,这些技术可能效率低下,并且不提供对样品质量的估计。在这项工作中,我们提出了一种新的近似采样技术,准拒绝采样(QRS),允许采样效率和采样质量之间进行权衡,同时提供显式收敛界限和诊断。 QRS大写从深度学习模型获得的高质量全球提案分布的可用性。我们展示了QRS采样对具有分布约束和解释生成的受控文本生成任务的分离EBMS对文本的有效性。我们表明,我们可以以采样效率的成本,从这些eBMS采样。
translated by 谷歌翻译
机器学习正在转向通用佩带的生成模型,以自我监督的方式在大量数据上训练,然后可以应用于解决大量任务。然而,由于其通用培训方法,这些模型通常无法满足一些下游要求(例如,在自动代码生成中的抽象摘要或错误格式的幻觉)。这提出了关于如何在不破坏其功能的情况下将预先训练的生成模型调整到新任务的重要问题。最近的工作建议通过代表基于能量的模型(EBMS)来解决任务特定要求,并使用分配策略梯度(DPG)近似这些EBM来解决这个问题。不幸的是,这种方法仅限于无条件的分布,由无条件的EBM表示。在本文中,我们通过提出条件DPG(CDPG)来扩展这种方法。我们在两个任务中评估了三种不同控制目标的CDPG:与T5和GPT-Neo的代码生成摘要。我们的结果表明,使用CDPG的微调稳健地将这些佩带的模型更接近地满足控制目标,而 - 与基线​​方法相比 - 不会导致灾难性的遗忘。
translated by 谷歌翻译
自然语言生成模型的力量引起了一种对自动方法的兴趣,以检测一段文本是人类或机器撰写的。到目前为止的问题已经以标准的监督方式框架,包括培训关于注释数据的分类器,以预测一个给定新文档的起源。在本文中,我们以无监督和分配方式框架问题:我们假设我们可以访问大量未经发布的文件,其中一大部分是机器生成的。我们提出了一种方法来检测利用重复高阶n-gram的那些机器生成的文件,我们在与人类中相比,我们在机器生成的文本中显示出来。弱信号是自我训练设置的起点,其中伪标记的文档用于培训分类器的集合。我们的实验表明,利用该信号使我们能够准确地对待可疑文件。对于Top-K采样策略,5000的精度超过90%,核心采样超过80%,我们使用的最大型号(GPT2-大)。模型大小增加的下降很小,这可能表明结果适用于其他当前和未来的大型语言模型。
translated by 谷歌翻译